Digitized  by  the  Internet  Archive 

in  2011  with  funding  from 

Boston  Library  Consortium  IVIember  Libraries 


http://www.archive.org/details/estimationofcompOOpowe 


i^OV  112  1985 


working  paper 
department 
of  economics 


THE  ESTIMATION  OF  COMPLETE 

AGGREGATION  STHUCTUEES        ^ 

James  L.  Powell  and  Thomas  H.  Stoker 

M.I.T.  Working  Paper  #346 
April  1984 


massachusetts 
institute  of 

technology 

50  memorial  drive 
Cambridge,  mass.  02139 


THE  ESTIMATION  OF  COMPLETE 

AGGREGATION  STEUCTUEES        ^ 

James  L.  Powell  and  Thomas  M.  Stoker 

M.I.T.  Working  Paper  #546 
April  1984 


*  Department  of  Economics  and  Sloan  School  of  Management,  M.I.T.   The 
authors  gratefully  acknowledge  support  from  the  National  Science  Foundation 
for  this  research. 


m\f  1  2  1985 


THE  ESTIMATION  OF  COMPLETE  AGGREGATION  STRUCTURES 
James  L.  Powell  and  Thomas  M.  Stoker 


1 .    Introduction 

The  purpose  of  this  paper  is  to  establish  several  results  which  permit 
efficient  estimation  of  behavioral  parameters  in  models  of  aggregate  data. 
The  generic  econometric  situation  of  interest  occurs  when  one  observes  data 
aggregates  for  a  given  number  of  groups,  which  for  example  could  be  separate 
time  periods.   The  model  for  these  data  accounts  for  the  problem  of 
aggregation  over  individuals  -  it  is  constructed  from  a  parameterized 
microeconometric  model  integrated  over  an  assvuned  distribution  form,  in  a  way 
which  permits  identification  of  the  micro  behavioral  parameters.   The 
question  addressed  in  this  paper  is  how  one  can  consistently  and  efficiently 
estimate  the  values  of  these  behavioral  parameters,  using  only  the  aggregate 
data. 

The  issues  involved  in  constructing  models  which  account  for  the  problem 
of  aggregation  over  individuals  are  extensively  covered  in  Stoker  (1984). 
The  empirical  implications  of  individual  heterogeneity  and  distribution 
shifts  on  aggregate  data  are  formally  modeled  by  an  aggregate  function,  which 
is  constructed  from  a  microeconometric  model  integrated  over  an  assumed 
distribution  form.   Completeness  of  the  aggregation  structure  (or  the  absence 
of  an  aggregation  problem)  occurs  when  the  parameters  of  micro  behavior  are 
identified  by  the  aggregate  function,  which  requires  specific  restrictions  on 
the  form  of  micro  behavior  and/or  the  form  of  the  predictor  variable 
distribution.   Stoker  (1984)  presents  several  examples  of  aggregate  functions 
and  characterizing  results  for  complete  aggregation  structures,  which  include 


models  with  intrinsically  nonlinear  microeconometric  equations  and  specific 
restrictions  on  the  predictor  variable  distribution. 

In  this  paper  we  show  several  results  which  facilitate  the  use  of 
complete  aggregation  structure  modeling  to  study  observed  aggregate  data  from 
large  populations.   In  particular,  we  begin  by  showing  that  natural 
consistent  parameter  estimators  can  be  obtained  via  weighted  least  squares 
(¥LS),  and  that  the  proper  choice  of  weights  provides  efficient  estimators, 
30  that  the  best  ¥LS  estimator  is  first  order  efficient  in  the  much  broader 
class  of  minimum  distance  estimators,  which  in  general  contains  maximum 
likelihood.   The  results  are  in  part  based  on  an  appropriate  modification  of 
the  principal  finding  of  the  theory  of  minimum  chi-square  estimation;  namely 
the  first  order  equivalence  of  minimum  chi-square  and  other  minimum  distance 
estimators.   As  part  of  the  exposition,  we  provide  examples  of  estimation 
problems  with  aggregate  data,  and  connect  the  optimal  selection  of  weights  to 
cross  section  regression  results  of  Stoker  (1983)- 

The  major  advantages  of  our  results  lie  in  their  generality  and 
computational  simplicity.   The  results  are  applicable  to  a  broad  range  of 
different  empirical  situations,  being  valid  for:   (i)  many  different  types  of 
aggregate  data  on  both  dependent  and  predictor  variables  -  averages,  medians, 
more  general  order  statistics,  etc.,  (ii)  virtually  arbitrary  forms  of  micro 
behavior  —  linear  or  nonlinear,  continuous  or  discrete  variables  —  and 
(iii)  general  specifications  of  the  predictor  variable  distribution.   The 
computational  simplicity  of  the  results  lies  in  the  first-order  optimality  of 
weighted  least  squares,  which  is  a  now  standard  (nonlinear)  estimation 
technique. 

¥e  begin  with  the  notation  and  basic  framework  in  Section  2.   Section  3 
gives  conditions  under  which  the  weighted  least  squares  estimator  of  the 


behavioral  parameters  is  consistent  and  asymptotically  normal,  discusses 
estimation  of  the  asymptotic  covariance  matrix  of  the  estimator,  and  shows 
how  the  optimal  weighting  scheme  is  related  to  a  particular  cross-section 
regression.   Section  4  shows  the  first-order  equivalence  between  general 
minimum  distance  estimators  and  weighted  least  squares  estimators  with 
appropriately  chosen  weights.   Section  5  suggests  some  natural  extensions  of 
the  framework,  and  gives  some  concluding  remarks. 


2.   Definitions  and  Notation 

We  assume  throughout  that  the  object  of  statistical  analysis  is  the 

estimation  (with  associated  hypothesis  tests)  of  an  unknown  parameter  vector 

5  characterizing  a  single  behavioral  equation, 

(2.1)     y.^  =  f(x.^,  u.^,  p^;  6), 

where  y. ,  is  a  variable  representing  a  behavioral  response  of  an  individual 

agent  i  in  a  group  of  agents  indexed  by  t,  x.j_  is  a  k-vector  of  (potentially) 

observable  characteristics  of  that  individual,  and  u.^  is  a  disturbance 

it 

embodying  unobserved  differences  in  agents  not  captured  by  x.,;  the 

X  U 

components  of  p,,  which  may  be  observable  or  unobservable,  represent 
variables  which  are  common  to  all  agents  in  a  particular  group  but  which  vary 
across  groups.   The  probability  law  generating  a  particular  realization  of 
y. .  is  thus  determined  by  the  true  value  6  of  the  unknown  parameter  6  and  by 

X  w  0 

the  distribution  of  characteristics  (x.. ,  p.,  u.,)  in  the  population  being 
studied.   In  the  aggregation  problems  considered  here,  we  assume  that  x.,  and 

X  b 

u.,  are  distributed  independently  of  p.,  and  that  the  distribution  of 

X  u  o 

(x. ,,  u.,)  within  each  group  t  has  a  density  h,(x,  u)  which  is  absolutely 

X  U     X  u  0 

continuous  with  respect  to  some  sigma-finite  measure  \,  invariant  across 
groups;   further,  we  suppose  that  this  density  can  be  parameterized  as 
(2.2)     h^(x,  u)  =  g^(u  I  X,  a^)p^(2:  I  9^), 

where  the  forms  of  g.(*)  and  p.(*)  are  known  up  to  the  vinknown  parameters  a 
and  9  . 

This  setup,  while  not  completely  general  in  its  treatment  of  individual 
and  group  heterogeneity,  characterizes  a  broad  class  of  economic  models  of 
interest.   The  groups  denoted  by  t  may  represent  cross-section  aggregation 
units,  i.e.,  observations  for  particular  states,  countries,  or  industries  in 
a  particular  period  of  time,  or  the  groups  may  correspond  to  populations 


varying  across  time.   When  the  group  specific  characteristics  {p. }  are 

completely  observable,  the  parameters  9  determining  the  distribution  of  the 

observable  individual  characteristics  i   completely  characterize  the  unknown 

aspects  of  distributional  differences  across  groups;   throughout  the  analysis 

below,  we  will  assume  that  the  {p±.}   are  observable,  and  thus  rewrite  the 

behavioral  model  as 

(2.r)    y,^  =  fA:c      ,   u   ;  6  J , 

with  the  sequence  of  functions  f,  indexed  by  the  observable  p  . 

If  a  random  sample  of  N  observations  on  (y.,,  ^-j.)  were  available  for 
each  group  t=1,...,T,  maximum  likelihood  estimation  of  the  unknown  parameters 
6  ,  a     and  9,  would  be  feasible;  because  {x.,,  i=1,...,N,  }  is  sufficient  for 
each  9  by  the  structure  imposed  in  (2.2),  the  principle  of  ancillarity  would 
lead  to  estimation  of  the  parameters  y'  ~  \-^o>    '^n  ]  "tiiro^gh  maximization  of 

the  conditional  likelihood  function 

T   N 
(2.3)     L  =  y   y  log  q.fy.. |x  ,  y). 
^  t=1  i=1      ^  ^^  ^^ 

where  q.(')  is  the  conditional  density  of  y. ,  given  x. ,  and  the  observable 

p  .   However,  we  suppose  that  these  "micro  samples"  for  each  group  are  not 

available  for  statistical  analysis,  perhaps  due  to  prohibitive  costs  of 

complete  data  reporting  or  confidentiality  considerations  for  the 

individuals  sampled;  instead,  only  aggregates  Y  and  X  representing  the 

"typical"  values  of  y.,  and  x.   are  available  for  each  group.   ¥e  further 

X  X         XX 

assume  that  the  aggregates  constructed  from  a  random  sample  of  (y..,  x.'  ) 

XX     XX 

within  each  group  (for  i=1,...N,  and  t=1,...,T)  and  are  "democratic"  to  the 
extent  that  they  depend  only  on  the  respective  empirical  distribution 
functions,  i.e.  Y,  and  X,  are  functionals  of  the  form 


(2. 3)  Y^  =  Y(Q^),  X^  =  X(P^), 
where 

(2.4)  Q,  =1  l'nj,,ij)   -=Q,iy), 

t  i=1 

1 1=1 

with  "1(A)"  denoting  the  indicator  function  of  the  event  "A"  and  the 
inequality  "x..    <  x"   being  interpreted  as  holding  coordinate-by-coordinate. 
To  these  empirical  aggregates  I,  and  X,  correspond  population  aggregates 
I(Q,)  and  X(P,),  where  Q,  and  P,  are  the  narginal  distributions  of  y. ,  and 
X.,    in  the  t —  group;  writing  these  explicitly  as  functions  of  the  unknown 
parameters,  we  have 

(2.5)  Y(Q^)  =  Y(Q^(yi9^,  y))    =  A^(9^,  y)   and 
X(P^)  =  X(P^(x|9^))  =  ?^(9^). 

A  key  assumption  in  our  approach  is  that  ?, (9.)  is  a  known  and 
invertible  function  of  9, .   This  implies  that  we  may  reparameterize  the 
distribution  of  x.,  in  group  t  by  ^i,  e  ?,(9,),  i.e.,  p  +  (x  |  Y,  (ii^^))  = 
p  (x  i  ^, );   further,  with  an  assumption  of  Fisher  consistency  of  the 
function  X(«) — that  is,  X(P,  )-*X(P,)  as  P,>P.  ,  where  the  latter  convergence  is 
appropriately  defined — the  empirical  aggregate  X.  will  consistently  estimate 
the  only  unknown  source  of  variability  of  the  joint  distribution  of  y. .  and 
X.,  across  groups. 

An  important  special  case  of  this  general  framework  occurs  when  Y(*)  and 
X(«)  are  expectations  operators, 

(2.6)  Y^  =  /  ydQ^(y)  =  Y^   and 


Identification  of  the  unknown  parameters  y  in  this  case  has  been  extensively 
investigated  in  Stoker  (1984),  who  relates  identification  of  y  "to 
completeness  of  the  family  {pa.(2C  |  9)  }  for  9  in  some  parameter  space  0.   As 
Stoker's  analysis  demonstrates,  the  assumption  that  Y.(9)  is  invertible, 
while  a  strong  restriction  on  the  availability  of  aggregate  data  on 
characteristics,  is  essential  if  the  behavioral  parameters  y  are  to  be 
estimable. 

The  general  approach  to  aggregation  adopted  here  also  applies  to  order 
statistic  data.   If  Y  is  the  p   quantile  (p  x   100th  percentile)  of 


{j^^'    i^'''  •••'  ^^]>    ^'^^^ 


-1 


(2.7)     I(Q^)  =  Q^"  (p), 

where,  as  usual,  some  rule  for  choice  of  a  particular  value  of  Y,  when  this 

inverse  is  set-valued  must  be  specified.   The  population  aggregate  associated 


with  Y,  is  then 


-1 


(2.8)     Y(Q^)  =  Q^"'(p  1  y,  9^) 

which  is  uniquely  defined  when  Qj.  is  continuously  differentiable  with 

positive  density  in  a  neighborhood  of  m(p)  =  inf  {y:Q.(y)  >  p}* 

With  these  preliminaries,  we  can  now  turn  to  the  central  question  of 
this  paper,  namely,  estimation  of  the  parameter  vector  y  when  only  T 
aggregates  {(Y,,  X!),  t=1,  — ,  T}  are  available  for  analysis.   One 
suggestive  approach  would  be  estimation  by  the  method  of  mazimum  likelihood; 
that  is,  given  the  functional  forms  of  the  behavioral  function  in  (2.1),  the 
density  of  the  characteristics  in  (2.2),  and  the  form  of  the  aggregation 
functionals  Y(«)  and  X(«)  in  (2.3),  the  joint  density  function 
<i^(y,  X  I  y,  [I   ,   N.  )  of  the  induced  probability  law  for  Y.  and  X.  can  be 


calculated,  and  y   and  \i     =    {\i'  ,  ., . .  ,\il)'    can  be  estimated  by 


=  argmaz  I     log  d^(Y^,  ^t  '  ''''  ^t '  ^t-*' 
Y.^i  t=1 


(2.9)    :  T 

Upon  further  consideration,  though,  two  drawbacks  in  this  strategy  are 

apparent.  First  calculation  of  the  joint  density  function  d,  of  Y,  and  X,  is 

far  from  trivial  in  general;  for  example  if  Y.  =  Y.  and  X.  E  X, ,  computation 

of  d,(')  involves  an  N  -fold  convolution  of  the  joint  density  of  (y.,,  x.'.), 

which  may  be  intractable  if  f^  and  h,  of  (2.1)  and  (2.2)  are  not  of  simple 

o      t 

form.   Another,  more  fundamental  issue  is  the  applicability  of  the  usual 
justification  for  maximum  likelihood  estimation  in  the  present  circumstance. 
The  usual  large-sample  properties  of  maximum  likelihood  —  consistency, 
asymptotic  normality,  and  asymptotic  efficiency  of  the  estimator  ~  could  be 
demonstrated  (under  suitable  regularity  conditions)  for  T  tending  to  infinity 
as  each  group  size  N  is  held  fixed  for  each  group  t;  however,  for  typical 
problems  involving  aggregates,  the  group  sizes  N^  are  large  relative  to  T.   A 
more  appropriate  asymptotic  theory  then  would  take  N  tending  to  infinity  for 
fixed  T;  the  likelihood  function,  rather  than  being  a  sum  of  an  increasing 
number  of  log  density  functions,  would  be  a  fixed  sum  of  log  density 
functions,  each  of  which  varies  as  N,  increases.   Since  the  standard 
asymptotic  results  on  maximum  likelihood  estimation  do  not  apply  in  this  case 
(which  we  assume  here),  and  since  calculation  of  the  density  functions  is 
problematic,  this  does  not  appear  to  be  the  most  promising  approach  to 
estimation  with  aggregate  data. 

A  more  attractive  approach  is  based  upon  estimation  of  the  population 
aggregates  in  (2.5)-   Reparameterizing  by  \i.    =   ^^(9.  ),  we  can  rewrite  (2.5) 
as 


(2.5')    Y(Q^)  =  \i^~\\i^),    y)  -  %(\^^,    y)  and 

x(p,)  -.   ^,, 

30  that 

(2.10)  Y(Q^)  =  $t[x(P^),  yJ- 

This  latter  relation,  plus  the  assumed  consistency  of  Y,  and  X.  for  Y(Q, )  and 
X(?j.)  suggests  estimation  of  y  ^7  minimizing  the  average  "distance"  between 
Y,  and  <5, (X  ,  y)  over  y,   where  the  average  is  taken  over  the  number  of  groups 
T.   A  convenient  measure  of  distance  is  a  weighted  squared  difference;  thus, 
we  will  consider  estimation  of  y  by 

(2.11)  y^  =   argmin  I     w^  (Y^  -  $^  (X^,  y))  , 

y    t=1 

where  (w  }  is  some  set  of  (possibly  stochastic)  nonnegative  weights.   The 
large  sample  properties  of  this  estimator  will  be  derived  in  the  following 
section. 

This  approach  is  quite  closely  related  to  minimum  chi  square  and  related 
minimum  distance  estimation  methods  for  multinomial  models  (see,  for  example, 
Rao  (1973),  or  Bishop,  Feinberg,  and  Holland  (1975)).   Nonetheless,  the 
analysis  here  differs  in  some  aspects,  most  notably  in  the  generality  of  the 
aggregates  Y.  and  X,  considered.   Furthermore,  unlike  the  usual  analysis  of 
cell  count  data,  it  is  neither  necessary  to  suppose  that  the  characteristics 
within  each  group  (which  are  analagous  to,  say,  dosage  levels  in  bio-assay) 
are  fixed  for  all  observations  in  that  group  nor  known  with  certainty;  it 
suffices  here  that  consistent  estimates  of  the  parameters  governing  the 
differences  in  the  distributions  of  characteristics  across  groups  are 
available. 

Still,  one  of  the  principal  findings  of  the  theory  of  minimum  chi-square 
estimation  —  namely,  the  first-order  equivalence  of  minimum  chi-square. 


10 


mazimum  likelihood,  and  other  minimum  distance  estimators  using  cell 
proportions  —  carries  over,  with  some  modification,  to  the  aggregate 
estimation  framework  considered  here.   In  section  4  below,  conditions  are 
given  under  which  the  weighted  least  squares  estimator  y     defined  in  (2.11) 
above  will  be  first  order  efficient  (with  appropriately  chosen  weights)  among 
■minimum  distance  estimates,  including  the  mazimum  likelihood  estimator  of 
(2.9).   la  that  section  we  also  consider  the  relation  of  the  t3rpe  of 
available  aggregation  functionals  Y,  and  X,  to  the  precision  of  the  weighted 
least  squares  estimator  of  y. 

Before  turning  to  the  large  sample  properties  of  weighted  least  squares 
estimation,  we  present  examples  of  aggregate  estimation  problems  for  two 
particular  economic  models. 

Example  1 :  (Continuous  Demand  Model) 

Suppose  that  y  represents  the  expenditure  on  a  given  commodity,  which  is 
determined  as 

^it  =  ^(^if  ^it'  Pt  '  ^) 

=  A(p^,  6)z.^  +  B(p^,  6)z.^log  z.^  ^  u.^, 

where  p,  is  a  vector  of  relative  prices,  presumed  constant  and  observable  for 

each  group  t,  z   is  individual  income  (expenditures  on  all  commodities),  and 

u.,  is  a  disturbance  term,  assume  to  have  zero  mean  conditional  on  z..  and 

it  it 

2 
variance  a  (assumed  fized  across  individuals  and  groups  for  simplicity). 

This  demand  function  is  of  the  PIGLOG  form  of  Muellbauer  (l975)f  and  includes 

the  AIDS  system  of  Deaton  and  Muellbauer  (1980)  and  the  translog  model  of 

Jorgenson,  Lau,  and  Stoker  (1982)  without  attributes;  6  represents  individual 


11 


preference  parameters.   We  take  6  =  y  . 

~    -     1     ? 
Supposing  Y  =  Y  =  —    i     j         is   the  aggregate  available  for 

^    ^    "t   t=1   ^^ 
consumption,  and  noting  that 

^^^it  I  ^it'  Pf  y)  =  \^Pt^^it  ^  \^Pt^  ^it  ^°S  :^i^, 

we  obtain  the  population  aggregate  Y, (Q,  )  by  taking  the  expectation  of  this 
formula  with  respect  to  the  distribution  of  x. ,.   Again  for  simplicity  we 
suppose  the  income  distribution  is  log-normal,  i.e.,  the  density  of  z.,  (with 

XX 

respect  to  Lebesque  measure)  is 

-(In  x-e^)^ 


^,  .        X    i(x>o)      r'       t-'  ^ 

P(z  I  9^,  X.)   =  ^— —^  exp  [ J 


't    2iiT,x      ^   „  2 

Since  this  distribution  has  two  time-varying  parameters,  we  require  two 
aggregates  for  the  characteristics  to  identify  y   I^ 

1   ^t 
^t  "  (^  N   ^  ^it'  °iedian(x^^,  i=1 ,  ...,  N^}) 
t  i=1 

is  available,  the  corresponding  population  aggregate  characteristics  for  the 

^th 

t   group  are 


X,(P,)  =  (E(x.^),  P;^1/2)) 


1   2 


(exp  (e^  +  -^  x^),  exp  (e^)) 


which  is  clearly  an  invertible  function  of  (9.,  t.)- 
For  this  distribution  of  characteristics, 
Y,(Q,)  =  E(y.^) 

=  A^(e^,  T^,  y) 


12 


or,  in  terms  of  ii., 

Y,(Q,)  =  ^ii,  [A^(P,)  -   B^(P,)(2  log  ^^^  =  log  ,2^)]- 
=  $t  i\i^,    y) 
With  these  calculations,  computation  of  the  weighted  least  squares  estimator 
Y  of  (2.11)  for  given  weights  {w  }  can  be  carried  out  with  standard 
nonlinear  optimization  techniques. 

Example  2:   (Discrete  Choice  Model) 

Suppose  now  that  y  is  1  or  0  according  to  whether  a  commodity  (say  a 

car)  is  purchased  or  not.  In  this  setup,  the  behavioral  model  may  be  of  the 
form 

^it  =  ^^-it'  ^it'  Pt'  S) 

=  l(^t  1  ^1  ^  ^2  ^°^  ^it  *  ^Pt^' 
where  x.,    denotes  income,  p,  denotes  the  commodity  price  and  u   represents 

individual  heterogeneity,  distributed  normally  with  mean  0  and  variance  1 . 

Again  supposing  the  aggregate  Y.  =  Y. ,  the  sample  proportion  of  individuals 

in  group  t  who  purchase  the  commodity,  and  that  x.,  is  lognormally 

distributed, 

p(x|  0.,  X.)  =  l(x>0)  exp   ^-(l°g  ^-9t^  , 

^  ^\ 

the  population  aggregate  Y(Q, )  can  be  written  as 


Y(Q^)  =  F 


6^  +  028^  +  5^p^ 


•^1  4.  ;^2  2 
^  "  ^2^t 


where  F  is  the  standard  normal  cumulative,  as  shown  by  McFadden  and  Reid 
(1975).   In  terms  of  the  previous  characteristic  aggregates 
^^^  =  (E(x^^),  P^  (1/2))  =  (mean(x^^),  median(x^^)) 


u 


3.    Large  Sample  Properties  of  Weighted  Least  Squares 

3.1   Consistency  and  Asymptotic  Normality 

For  convenience,  we  summarize  the  discussion  of  the  previous  section  in 
the  following  assumptions: 

Assumption  A1  :   For  each  group  t  (t=1,...,T),  N.  underlying  observations 
(y-^.  X.,,  u.,),  i=1 ,  ...5  N  ,  are  generated  from  the  behavioral  relation 

^it  =  ^t^^it'  ^it'  ^0^ 
and  i.i.d.  draws  of  (x.',  ,  u..)  according  to  the  probability  law 

Pr((x^^,  u^^)  e  A)  =  /  l((x',  u)  e  k)gju  \  x,    a^)v^U   I  e^)d\ 

for  X-measurable  f.(«)  and  A,  and  for  v  =  (6*,  o')'  a  fixed,  finite 

t  '000 

dimensional  vector  of  unknown  parameters. 

Assumption  A2;   For  each  group  t  =  1,...,T,  the  individual  observations  are 

-"  ^  yS  A 

summarized  by  empirical  aggregates  Y,  =  Y(Q.)  and  X.  =  X(P,).  where  Y  and 
X  are  functionals  defined  on  the  space  of  distribution  functions  of  y.,  and 
X  ,  respectively,  and  Q  and  P  are  empirical  distribution  functions  for 
(y..,  i  =  1,...,  N.)  and  (x.,,  i  =  1,...,  N  ),  respectively. 

Assumption  A3:   The  population  aggregates  Y. (Q.)  and  X.(P,)  satisfy  the 
relation 

Y,(Q,)  =  $,(.,.  y,) 

where  the  functional  form  of  $.  is  known  for  each  t,  and  where  Q.  and  P^^  are 

the  marginal  d.f's  of  y   and  x   in  group  t. 

it      it 
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To  demonstrate  strong  consistency  of  the  weighted  least  squares  estimator  y 
of  (2.11)  as  N,  -►  <=,  we  impose  further  regularity  conditions  on  the  unknown 
parameters,  the  aggregation  functionals,  and  the  group  weights. 


Assumption  A4;   The  parameter  vector  y  is  an  element  of  a  compact  parameter 
snace  ?• 


Assumption  A5:   The  aggregation  functionals  Y  and  X  are  Fisher  consistent  at 
Q,  and  ?.  for  each  t;  i.e.,  if  Q.  and  P,  converge  weakly  to  Q.  and  P.,  then 
Y(Q^)  ^  Y(Q^)  and  X(P^)  ^  X(P^). 

Assumption  A6:   For  each  t,  the  function  $i(p.,  y)   is  continuous  in  y   for  any 
\x,    smd  $  is  continuous  at  ^i,  uniformly  in  yeT,    i.e., 

sup  i  ^^{\i,   y)  -  ^t'^'^t'  y)  i  -^  0  ^3  ll^L^-^ill  -^  0" 

yeT 

Assumption  A7;   The  (possibly  stochastic)  weights  {w,  }  converge  almost  surely 
(as  N.  ->■  "=)  to  some  nonnegative  constants  {w.  }• 

"^  2 

Assumption  A8;   If  y  ^  Yq'  T^^»  ^^^^      I     w   ($  ( ^  ,  y)  -  $( ^  ,  y  ) )   >  0, 

where  the  weights  {w, }  are  given  in  A7. 

Considering  each  of  these  conditions  in  turn,  assumption  A4  is  ty^iically 
imposed  in  such  nonlinear  estimation  problems  as  considered  here;  with  the 
continuity  of  $,  in  y  (given  in  A6),  it  ensures  the  existence  and 
measurability  of  y  ,  and  is  imposed  in  the  lemma  we  use  to  show  strong 
consistency.   Assumption  A5  and  the  uniform  convergence  (with  probability 
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one)  of  Q.  and  P.  to  Q,  and  P,  imply  the  almost  sure  convergence  of  Y,  and 

X,  to  their  respective  population  aggregates,  as  noted  by  Rao  (1973,  p. 346). 

Conditions  A6  and  A7  allow  us  to  demonstrate  the  uniform  almost-sure 

convergence  of  the  WLS  minimand  to  the  sum  given  in  assumption  A8;  the 

identification  condition  in  A8  is  a  stronger  version  of  the  notion  of  a 

"complete  aggregation  structure"  introduced  in  Stoker  (1984)  (which,  in  this 

context,  would  require  only  that  $,  (ti,  Yi  )  ~  54.(M'»  Y  )  ''^^'^   ^®  identically 

zero  in  ^  for  any  y^  ^  Y  )° 

With  these  Dreliminary  conditions,  strong  consistency  of  y  is  easily 

w 

demonstrated. 


Theorem  3.1:   Under  A1  -  A?,  the  estimator  y  of  y  is  strongly 
consistent,  i.e.,  if 

N  =   min{N.,  t=1 ,  ...,  T}  and 

^  1  A  /s  ~ 

y  =  argmin  I     w  (Y  -$  (X  ,  y) ) 
ysT       t=1   ^  ^  ^  ^ 

=  argmin  S^{y), 

Y^r 

then  y  -^y  a.s.  asN-*-". 
'w    'o 

Proof;   As  noted  above,  A2,  A5,  and  the  Glivenko-Cantelli  lemma  imply 
X.  -►  \i.    and  Y.  -»■  'I'.(n.  ,  y  )  a.s.  as  N  ->■  <=,  and  w.  ->  w,  by  assumption  A?. 
Writing 

(3.1)    S(y)  =  I     w^($^(n^,  Yg)  -  ^t^H^'  r))  ' 
we  clearly  have 
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(3.2)   sup  I  Sjj(y)  -  S(y)  I  ■*   0  a. 3.  as  N  -►  <=, 
yzT 

since,  for  example, 

T  ^     .  2 

sup  I     w^($^(X^,  y)  -  '^t^^t'  ^^^   "*"  '^'  ^•^•' 
Yer  t=1 

by  assunption  A6  and  the  strong  consistency  of  X,  and  w, .   Because  y 

t       u  o 

uniquely  minimizes  S(y)  by  assumption  A8,  Lemma  2  of  Amemiya  (1973)  yields 

the  almost  sure  convergence  of  y  to  y  • 

•=        'w    'o 


Before  turning  to  conditions  for  asymptotic  normality  of  y  »  we  should 

reiterate  the  essential  difference  between  this  approach  and  the  usual 

demonstration  of  consistency  of  extremum  estimators.   By  taking  T  fixed  as 

N  ->•  CO,  we  can  write  the  estimator  y  as  a  fixed,  continuous  function  of 

w 

{(Y.,  X!,  w,)}  for  suitably  large  N,  so  consistency  of  y  follows  from 
consistency  of  these  aggregates  and  the  properties  of  the  function  defining 
Y  •   In  contrast,  the  usual  approach  would  have  the  minimized  S„  and  the 
resulting  estimator  y„  varying  with  the  sample  size  T,  requiring  stronger 
conditions  (such  as  bounded  moments  of  the  aggregates  Y  and  X, )  to  establish 
strong  convergence.   Indeed,  taking  N.  fixed  for  all  t  and  assuming  T  -»■  °°,  we 
would  not  expect  y   ,    as  defined  above,  to  be  consistent  if  $,  is  nonlinear  in 
X, .   The  appropriate  regression  function  in  the  latter  case  would  not  be  $. 
but  E, (Y,  I  X. ),  which,  in  addition  to  the  computational  problems  noted  in 
section  2,  would  in  general  involve  the  distribution  parameters  ^,  if  the 
behavioral  function  f.  of  A1  were  nonlinear  in  (x. .,  u. ,)   Consistent 
estimation  in  this  context  would  appear  to  require  an  instrumental  variables 
approach  or  strong  restrictions  on  the  incidental  parameters  {p.^}* 

To  show  asymptotic  normality  of  the  estimator  y  »  we  further  restrict 
the  model  with  the  following  conditions: 


18 


Assum-Dtion  AQ:   The  parameter  vector  y  is  an  interior  point  of  the 

1 IQ 


parameter  space  r> 


Assumption  A1 0;   The  empirical  aggregates  Y  and  X  are  of  the  form 

£,  (v.  ,  .  U.  .  Y  ")  +  t   , 


In 
h   =  ^(^^t'  ^0^  *  ¥   I]   h^^lt'    h'    ^o)  *  ^p^^/-^  ) 


t  i=1 

1   ^t  — 

=  ^^^^t'  ^0^  *¥,  .1  ^it  ^"p^V'^^t  ) 
t  1=1       ^ 

1   ^-  _ 

and  X^  =  ^t  "¥  r   C,(^it'  *^t^  ^  "p^^/'^t  ^ 
t  1=1  -^ 

1   ^t 

"^^t  "¥,  .^,  ^it  ^°p^^/^^' 
t  1=1       ^ 


with      E^  U.^,  q^)  =  0 


and 


^it 


^ht'    ^It^  ^  ^t 


a  Z 

xy   S:x 


Assumption  A1  1  ;   The  function  $.  (|i,  y)  is  continuously  diff erentiable  at 


p,  =  (ij.,  y  =  Y  ^°^  each  t 


Assumption  A1 2:   For  N  =  min{N,,  t=1  ,  •••,  T}, 

lim  (N/N.)  =  K,   e  (0,  l]  for  each  t. 
N^- 


Of  these  additional  conditions,  only  A10  requires  further  elaboration. 
This  assumption  requires  the  aggregation  functionals  Y.  and  X.  to  be 
sufficiently  regular  to  admit  a  Taylor's  series  expansion  at  Q.  and  P, ;  the 
functions  ^+(y,  \i.,    y    )  and  cA^t    li^.)  are  then  the  influence  curves 
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associated  with  Y  and  X   (see  Huber  [l982],  Chapter  1).   When  Y(  •)  and  X(«) 
are  expectation  operators,  the  representation  in  A10  is  trivial,  while  if 
Y  =  median  (y..,  i=1 ,  •••,  N,},  for  example,  it  can  be  shown  that 


(3.3)     I,  =  $,  i^,,    T,)  *  ^  f    [2  *,(,,.  Y,)]"'  3gn(y  -$  (,   y  )) 

t  1=1 

*  o(l//V 


if  7   is  continuously  distributed  with  positive  density  at 

'^o.d^j.f  Y  )  ~  Ql  (1/2),  where  !t)i(^.  ,  y  )  is  the  density  evaluated  at 

0   t    0       t  0   t    0 


$^ 


Theorem  3 '2:   Under  A1  -  A12,  the  weighted  least  squares  estimator  y  is 
asymptotically  normal, 

/N  (y   -  y  )  >  N(0,  M~''v  M"'' )  as  N  ^  <=, 


where 


M  =  I  w 
t=1   ^ 


dy   dy ' 


V  =    l^  w^  <,  (v;  z^  v^) 


a^t    ^^t 


5y   oy 


^0'  ^'t 


for  w, ,  K    ,    and  a,  defined  above  and 


r[=-    (1,  - 


a^ 


^o'^^t 


). 


Proof:   ¥e  first  note  that,  for  each  t, 


(3.4) 


/N 


N(0,  K^E^), 


It    -    S^^t'     ^0^ 

^t  -  ^^t 
independently  across  t,  by  application  of  the  Lindeberg-Levy  central  limit 

theorem.   Since  y  is  interior  to  T   and  y     -*■  y     a.s.  by  Theorem  1,  the  first- 


order  condition 
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(3.5)     0=  J^  w^^-^,(X,.  y,)) 


a$. 


dY 


^t,  \ 


must  be  satisfied  for  all  N  suitably  large  by  the  definition  of  y   .   Making  a 


further  mean-value  expression, 

T  ~ 

(5.6)   0  =   y   w^ 


t=1 


^^t  -  ^^^^t'  V 


5^ 


5$^ 

d7^ 


*   *  ^^t  "  ^t^ 
^^t'^t 


*  *   ^\  ~   '^o'^ 


H'^t 


by 


t'  '>C 


for  \i.,    Y^.  ^  convex  combination  of  Xj.,  y     and  p.,  ,  y  .   Since  w  ,  X  and  y 
converge  a.s.  to  w, ,  ^.,    y   ,    and  the  matrix  M  defined  above  is  nonsingular 
by  A8  and  A1 1 ,  the  expression  above  implies 


(3.7)   "  /N  (■ 


w 


Yq)  =  m"    );  w^  /n 


t=1 


(^t  -  ^t^^^t'  ^0^) 


an; 


(X.  -  uj 


^4..Y  + 


H'    Y. 


-0^(1) 


by  the  continuous  diffentiability  of  5. .   The  result  then  follows  by  the 
usual  properties  of  the  multivariate  normal  distribution. 


3.2  Estimation  of  the  Asymptotic  Covariance  Matrix 


The  proof  of  Theorem  3.2  shows  that,  for  large  N,  the  estimator  y, 


w 


behaves  like  a  weighted  linear  regression  of  {/N[Y,  -  $.(p,.,  Yq^*  -^t  ~  ^''t-J^t^ 
on  the  vectors  {a$,(p..,  y   )/&y}*   ^^  <*.  ^^^   vIe  v  were  known  (or  could  be 
consistently  estimated  as  N  ->■  <i>),  an  efficient  choice  of  weights  {w .  }  (in  the 
Gauss-Markov  sense)  would  satisfy 


<  v'  r  V 

3    3    3    3 


^t  ^  -t  ^ 
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1/2 


(3.8)     lim  -r^  = 

N-H33     W 

3 

In  some  circumstances,  the  matrices  {z  }  will  be  completely  determined 

by  the  unknown  behavioral  parameters  y  and  the  population  characteristic 

aggregates  (p.,  }  (e.g.  in  Jbcample  2  of  Section  2  above);  in  this  situation, 

construction  of  optimal  weights  w  using  Z,  =  E.  (X^.,  y   )  is  straightforward. 

In  other  cases,  knowledge  of  the  population  aggregates  {Y(Q. ) }  and  (X(P  ) } 

will  be  insufficient  to  determine  the  elements  of  {Z^.}'   In  the  continuous 

demand  example  of  section  2,  the  covariance  matrices  (z.  }  involve  the  unknown 

variance  a     of  the  unobserved  heterogeneity  terms  {u., };   this  parameter  is 

clearly  unidentified  given  only  the  location  parameters  of  the  data,  i.e., 

given  only  the  population  means  of  {y^4.}  and  the  population  means  and  medians 

of  {z.,}.   In  general,  dependence  of  the  distribution  of  y.,  on  unknown 

nuisance  parameters  not  included  in  y  inhibits  not  only  construction  of 

efficient  weights  but  also  consistent  (for  T  fixed)  estimation  of  the 

covariance  matrix  of  y  ,  which  will  depend  upon  these  nuisance  parameters 

(through  Z.). 

Consistent  estimation  of  the  covariance  matrix  of  y  in  this  situation 

'w 

does  appear  feasible  if  T,  the  number  of  aggregation  groups,  also  tends  to 
infinity  with  N,  the  minimum  group  size.   While  we  do  not  explore  the  details 
here,  it  seems  reasonable  to  expect  that  the  nuisance  parameters  in  Z+  oan   be 
consistently  estimated  using  method-of-moments  estimation  applied  to  the 
"between  group"  averages  of  squared  values  of  the  residuals 
/N  (Y.  -  $.(X. ,  y  ))  =  e. ,  whose  limiting  distributions  have  covariances 
involving  $. ,  \i.,    y  ,  and  the  nuisance  parameters.   Alternatively,  following 
Eicker  (1967)  and  White  (1980),  we  could  estimate  V,  the  matrix  given  in  the 
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statement  of  Theorem  3*2,  by 

(3.9)    V 


?  :rA.'^ 


5$^  a^t 


t=i 


t  V  N 


^t'  \ 


for  e.  defined  above  and  (w, }  a  given  (not  necessarily  optimal)  set  of 
weights.   Because  the  proofs  of  Theorems  3«1  and  3-2  used  the  assumption  of 
fixed  T,  the  foregoing  results  cannot  immediately  be  generalized  to  permit 
T-«°;  still,  the  results  seem  very  likely  to  hold  in  this  case,  with  further 
regularity  conditions,  and  we  conjecture  that,  under  such  conditions, 
(3.10)    plim  T"''[V  -  v]  =  0 

T-HS> 

provided  T/N  ^  0  as  T  ^  <=,  30  that  large  sample  inference  on  y  will  be 
feasible  if  the  number  of  groups  is  large. 


3.3  Relation  to  Cross  Section  Regression 


It  is  clear  from  the  above  development  that  v  ' Z  v ,  (normalized  by  N  ~  ) 
is  the  relevant  "true"  aggregate  residual  variance  of  departures  of  Y  from 
*>+(X.  ,  Y  ).   In  this  section  we  point  out  how  v, 's,v,  can  be  regarded  as  the 
residual  variance  from  cross  section  linear  regression,  which  indicates  how 
nonlinearities  in  micro  behavior  affect  the  size  of  the  aggregate  residual 
variance. 

For  the  remainder  of  this  section  we  focus  attention  on  a  single  time 

period  t.   Consider  the  individual  i.i.d.  influence  terms  I.,    and  C.  of 

assumption  A10,  for  i=1,...,N.  (or  a  smaller  random  sample)  and  define 

3_  =  I  C^4.  /N,,  S   =  y  C.4.  E.^/N.  and  S   =7  C-4.  C-^/^^.   as  the  sample 
yy   V  ^it  '  t'  xy   {■  ^it  ^it^  t      xx   P  ^it  ^it'  t 


covariances,  with 


S  = 


yy 
3 
^7 


s* 

S 

XX 
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where  ^.  .  =  Z±t~Zj^j->    ^-jt  ~  C^^-C^  are  the  deviations  from  sample  averages. 

By  a  straightforward  application  of  the  (i.i.d.)  Strong  Law  of  Large  Numbers, 

we  have  that  lim  S  =  E.  a.s.  under  A1,  A10  and  A1 1 • 

We  can  obtain  several  interpretative  results  by  considering  cross 

section  linear  equations  of  the  form 

(3.11)    4  =  h'q,  -u^^      i=l,...N^ 

where  B  could  be  considered  as  the  slope  coefficient  of  the  regression  of  5., 

on  r   which  includes  a  constant  term.   We  denote  the  OLS  coefficient 

estimates  as  8,  =  (S   )   S   ,  and  B,  =  lim  3,  =  (z   .)  Z        .•   Of  SDecial 
'^t     XX         xy  '^t  ^t  XX, t    2y,t 

interest  is  the  large  sample  residual  variance  from  (3*11 )»  which  we  denote 

v  *     *  2 
as  the  almost  sure  limit  a(b)  =  lim  ( /,(5.  ■ -b' C- j.)  /N,).   Clearly  a(b)  is 

minimized  at  b  =  p, ,  and  we  denote  the  minimum  as  a{^.)   =   a^ 

The  first  interpretative  result  is  available  by  direct  computation  — 

namely,  under  assumptions  A1  ,  A10,  A1 1  ,  a 


an 


^t^V 


^ '  ^t^ ' 


Consequently,  v.'Z.v.  can  be  viewed  as  the  cross  section  variance  of  the 


difference  E, 


it 


a£ 
an 


^i.»Y, 


Q..f    or  as  the  large  sample  residual  variance  of 

X  b 


equation  (3.11)  where  b  is  set  to  b  =  -r— 


?^ 


,  the  macroeconomic  effects. 


Of  more  interest  is  the  condition  under  which  v, ' S. v,  =  a,  or  that 
v.'E.v.  represents  the  (least  squares)  cross  section  residual  variance  of  ^.. 
regressed  on  C-j.*   In  view  of  the  above  observation,  an  obvious  sufficient 

X  u 


condition  occurs  when  lim  B,  =  6j_  =  — 

'^t   ^t   d\i 


,  or  when  6,  consistently 


estimates  the  macroeconomic  effects.   This  property  is  studied  extensively  by 


24 


Stoker  (1982,  1983).   We  state  the  following  proposition,  which  easily 
follows  from  the  development  of  Stoker  (1983). 

Theorem  "5 •3'      Under  Assumptions  A1-A6,  A1 0,  and  All,  a  sufficient 
condition  for  v'l;  v  =  a   occurs  when  the  influence  function  C,{x,ii.)   can  be 
written  as 

(3.12)    C(x,^i)  =  B(n)i(x,^i) 

for  all  X  and  for  ^  in  a  neighborhood  of  \i   ,    where  B(  \i)   is  a  nonsingular 
matrix  and  JI(x,m,)  is  the  score  vector 

ain  p(x|  \ij 


(3.2)     i(x,;i)  =     ^^ 

Condition  (3.12)  is  valid  (locally)  if  and  only  if  the  aggregate  X.  is  a 

(locally  )  efficient  estimator  of  n, . 


Thus  the  true  aggregate  residual  variance  v!Zj.v,  coincides  with  the 
linear  cross  section  residual  variance  when  c(x»M-)  is  (locally)  collinear 
with  the  score  Z(x,\i).      In  particular,  when  B(|i)  is  the  inverse  of  the 
information  matrix  I  =  E(  M' |  p.) ,  as  in 
(3.14)    C(x,n)  =  (l^)"^Jl(x,n), 


9$ 

then  we  have  lim  6^=6,  =  — 

t   ^t   5^1 


Moreover,  the  score  condition  is 
^t'^o 


associated  with  the  efficiency  properties  of  the  consistent  aggregate  X . , 
being  globally  valid  when  X  is  efficient  (and  hence  a  sufficient  statistic) 
for  \i.      The  condition  holds  locally  if  and  only  if  X,  is  locally  sufficient 
and  efficient  in  a  neighborhood  of  ^=ii.  •   While  we  do  not  provide  proof 
here,  it  is  easy  to  see  that  if  X  is  consistent  and  locally  efficient  for  p., 
then  the  influence  function  can  be  written  in  the  restricted  form  (3.14). 


25 


The  sufficient  conditions  of  Theorem  3' 3  are  of  interest  because  they 
characterize  the  aggregate  residual  variance  v,'S+v,  as  a  cross  section  OLS 
residual  variance,  regardless  of  the  true  form  of  the  micro  behavioral  model 
J       =  f  (z.,,  u   ).   Thus,  when  the  aggregates  X  are  efficient  estimators  of 
the  parameters  \i.,    the  aggregate  residual  variance  v, 'S,v,  =  a(p.  )  does  not 
depend  on  whether  the  cross  section  OLS  residuals  E^,-^^.'C^,    arise  from 
structural  nonlinearities  of  f,  in  z.,  or  from  random  disturbances.  While 
other  sufficient  conditions  can  be  found,  they  will  necessarily  depend  on 
restrictions  on  the  functional  form  of  micro  behavior  y .  4.  =  f^.i's..^,   u,.  ,), 

such  as  conditions  which  would  imply  that  E,.,    is  a  linear  function  of  Q., 

lb  xt 

plus  an  independent  disturbance. 

For  motivation,  it  may  be  useful  to  consider  these  interpretations  when 
the  aggregates  X^.  and  Y,  are  simule  averages  of  the  resnective  micro 
variables.   For  X.  =  X.  =  ^  z^^^/N.,  (3«12)  is  valid  (locally)  if  and  only  if 
p(z  I  \i)    is  (locally)  in  exponential  family  form  with  driving  variables  X: 
(3.15)    p(z  I  \i)    =  c(n)h(x)ezp[it((i) 'z] 

In  this  case  Q-.    =  z.,  -  p..  and  L  is  the  OLS  slope  coefficient  vector  of  ^. 
regressed  on  ^^±.'      When  in  addition,  Y.  =  I.  =  J]  y^^/N,  ,  we  have 
E,..    =  y..-<5(n,,  Y  ) ,  so  that  p.  is  the  OLS  slope  coefficient  of  y.,  regressed 
on  z.,,  and  a  =   v, 'E.v,  is  the  large  sample  residual  variance  from  the 
regression. 
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4.   First-Order  Equivalence  of  Weighted  Least  Squares  and  Minimum  Distance 
Estimation 

In  the  previous  section  we  discussed  how  the  efficiency  of  the  weighted 
least  squares  estimator  y  would  be  maximized  through  judicious  choice  of 
weights  Iw^}.   In  this  section  we  compare  the  large  sample  behavior  of  y     to 
a  broader  class  of  "minimum  distance"  estimators,  such  as  the  maximum 
likelihood  estimator  y  of  (2.9)  above.   Under  appropriate  conditions,  it 
will  be  shown  that  such  estimators  will  be  first  order  equivalent  to  a 
weighted  least  squares  estimator,  with  weights  depending  upon  the  form  of  the 
minimand  defining  the  estimator.   An  interesting  result  is  that  the 
corresponding  estimators  of  ^.  need  not  be  first  order  equivalent  to  X. ,  even 
though  the  estimator  of  y  is  first-order  equivalent  to  some  y  which 
imDlicitly  uses  \l^_  =   X,  in  its  construction. 

Initially,  we  restrict  attention  to  a  narrower  class  of  estimators  than 
one  which  includes  maximum  likelihood,  namely,  estimators  of  y  and  a     = 
(ti^  ,...,  \i^)'    defined  by 

(4.1)   Y  =  argmin  I   v^  (Y^,  X^,  y,  \i^) , 
-     y,\i       t=1 

where  the  {v,(«)}  are  fixed  "distance  functions"  with  properties  to  be 

specified  below.   The  maximum  likelihood  estimator  of  (2.9)  is  not  in  this 

class,  since  its  "distance  function"  depends  upon  the  group  sample  size  N.  as 

well  as  the  group  index  t;   however,  we  will  extend  our  results  to  include 

this  case  as  well. 

We  impose  the  following  conditions  on  the  parameter  vector  \i     and  the 


criterion  fimctions  {v.}: 
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Assumption  B1  ;   The  vector  p.  of  true  characteristic  aggregates  is  an 
interior  point  of  a  compact  parameter  space  M. 

Assumption  B2:   For  each  t  and  all  y    e   int  F,  n  e  int  M, 
v^(y,  X,  y,    \i)   =  0  if  J  =   $^  (p,,  y)   and  x   =   \x   and 
Vj.(y>  x»  Y,  p.)  >  0  otherwise. 

Assumption  33;   The  {v, }  are  fixed  functions  which  are  twice 
continuously  differentiable  in  their  arguments. 

Assumption  B4;   Under  assumptions  A1  through  A12,  the  minimum  distance 
estimators  y   and  \i   are  /N  -  consistent,  i.e.,  •^(y-y)~0(1)  ^^'^ 
/N  Q   -   M^)    =  Op(l). 

The  points  {\i±.\   are  assumed  to  be  interior  points  of  M  in  B1  so  that 
Taylor's  series  arguments  are  applicable;  compactness  of  M  is  not  used  in  the 
theorem  which  follows,  but  is  needed  to  extend  the  results  when  the  distance 
functions  depend  upon  N.  .  B2  makes  more  precise  the  sense  in  which  v^.{•) 
reflects  the  "distance"  between  the  population  aggregates  (Y(Q. ),  X(P, ))  and 
any  arbitrary  (y,x),  while  B3  permits  criterion  functions  to  be  approximated 
by  a  quadratic  form  in  a  neighborhood  of  the  true  values  (y,  x,  y»  ^)  ~ 
($+(H4.»  Y  )»  \^i.t    Y  »  \^t.)°      Finally,  condition  B4  restricts  attention  to 
mimimum  distance  estimators  which  are  interesting  alternatives  to  y„>  ruling 
out  superefficient  or  inconsistent  estimators;  the  condition  is  imposed  in 
lieu  of  a  list  of  assumptions  which  would  imply  it,  since  our  objective  is 
only  to  show  that  such  estimators  are  of  weighted  least  squares  form 
asymptotically. 
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We  note  that  y  and  the  {X. }  are  not  in  the  class  of  mimiraum  distance 
estimators  defined  here  unless  the  weights  w  are  fixed  and  nonstochastic  and 

the  functions  $.(^i,  y)  a^®  twice  differentiable ,  in  which  case 

2        2 
(4.2)     v^(7,  x,  Y.  1^)  =  w^(y-$^(x,  y))   +  H^c-^ll 

would  generate  y  =  y  ^^^  \i     =   X  in  (4.1  )• 


Theorem  A.1 ;  Under  A1  -  A12  and  B1  -  B4,  the  mininun  distance  estimator 
y  of  y  given  by  (4-1)  is  first-order  equivalent  to  a  weighted  least  squares 
estimator, 


/N  (y  -  y^)  =  Op(l), 
where  y  is  of  the  form  (2.11)  with  weights  proportional  to 


^t  = 


^2 


ay 


5  V, 


aydu' 


a  Vj 


a^a^i' 


a  v_. 


aiiay 


with  the  derivatives  being  evaluated  at 

(y,  X,    y,  \i)    =  [^^(li^,  Yq).  ^t'  ^o'  ^t-^' 


Proof:   Let  H  denote  the  matrix  of  second  partials  of  v^(y,  x,    y,  \i) 
evaluated  at  [$^(^l^,  y^),  \i^,    y^,  \i^.)]    •   Since  v^($^(ia,y),  \i,    y,    \i)   =  0 
identically  in  y  and  p.,  H  must  satisfy  the  restrictions 


(4.3) 


a$^/ay  0   1^0 


H  =  0, 


^a$^/ap  i^  0  1^ 

where  the  derivatives  are  evaluated  at  p..  and  y  ,  k  =  dim(p,.),  1  =  dim(y), 
and  I  is  the  identity  matrix  of  order  m.  We  denote  the  various  component 
submatrices  of  H  by  a  double  subscript,  e.g., 


a^v. 


H 


\^y 


a^ay ' 


^^t^^t'  ^0^'  ^'t'  ^o'  '"t^ 


,  etc. 
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With  this  notation,  we  observe  that,  with  arbitrarily  high  probability, 
the  minimum  distance  estimators  must  satisfy  the  first-order  condition 


(4.4)     0  =  I   5r  ^^'  ^t'  Y'  ^t^ 


t=1 


for  N  suitably  large.   Making  a  mean- value  expansion  of  the  left-hand  side  of 
(4.4)  at  the  limiting  values,  we  have 


(4.5)     0   =   1  K^  /N  (Y^  -  5^(p,,  yJ)  ^  H_  /N    (X,  -  nj 

t=1 


yj      -     -'t    "^f-f  '0'  '     Y^ 


.  +  H   /N  (y  -  Y  )  -^  H   /N  (li.  -  n.  )  +  0  (1  ), 
YY         0     YP.      t    t     P 

where  the  convergence  of  the  remainder  term  to  zero  in  probability  follows 
from  the  continuity  of  the  hessian  matrix  and  the  /N  -  consistency  of 
^^t   ^^ »  Y»  ^^'^  ^^^.  •  Now,  using  the  restrictions  given  in  (4.3), 


(4.6)     0  =  ^  H 


t=1 


77 


5$^ 


/N  (Y,  -  $^(,,.  Y,)) 


t=1 


a$^ 
ay" 


a$ 


OY 


^y  -  yJ 


a$^ 


op. 


^   (X,  -  .,) 


H^   /N  (i^  -  H^)  .o-(l), 


where,  as  always,  all  derivative  are  evaluated  at  the  true  values. 

To  obtain  an  expression  for  the  term  involving  (X,  -  \i^.) ,   we  use  the 
first-order  conditions  for  \i., 
(4.7)     0  =  av^  (Y^,  X^,  Y,  i^^)      , 

=  H^y  /N  (Y^  -  $^(u^.  Yo))  *  H^  ^  (i^  -  ,^) 

+  HiiY  /F  (y  -  Y  )  +  H   /fJ  (X^  -  n.  )  +  0  (1  ) 

'    '   '0    ^,^i     t   t    p 
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which  also  must  hold  in  probability  for  large  N  by  B1  and  B4.   Again  using 
the  total-differential  restrictions  of  (4.3),  we  can  express  the  difference 
/N  (X^  -  u^) 


as 


-1, 


(4.8)     /¥  (X^  -  ,J  =  H^-  H^^   /N  (Y^  -  $^(,^,  y^)) 

- 

+  ^^  /N  (X,  -  ^,)  .   ^^t  ^C^-  Y  ) 

r      t     t      r         0 

5|i  dy 

where  the  nonsingularity  of  K   follows  from  the  assumed  /N  -  consistency  of 


-Op(l), 


\i^.     Combining  (4.5)  and  (4-7)  yields 


(4.9) 


0  =  y  [k   +  k  k  "  h  1 


5$^ 


5$ 
5p^ 


t   =: 


v^N  (Y,  -  $^(,^.  y^)) 
3$ 


/N  (X^  -  ^^) 


+  ^  y^  (y  -  y  ; 
dy      '    '0 


.  Op(l), 


Hence  /N  (y  -  y  )  can  be  written  exactly  in  the  form  of  equation  (3.7)  above, 


with 


(4.10)    w^  =  [H 


+  H  H  ^H 


yy    yii  ^i^i  ^ly^' 
which  is  nonnegative  by  B2  and  B^- 


As  noted  above,  Theorem  3  applies  only  to  estimators  with  distance 
functions  which  do  not  vary  with  N , ;  thus,  both  the  maximum  likelihood  and 
weighted  least  squares  estimators  are  excluded  from  this  class  (the  latter 
due  to  its  dependence  on  estimated  weights  w. ).   To  extend  the  result  of 
Theorem  3  to  the  case  in  which  the  distance  functions  v.  (y,  x,  y,  \i)   are 
(possibly  stochastic)  functions  of  N.,  we  impose  the  following  conditions: 


Assumption  B3:   The  functions  {v, }  are  twice  continuously  differentiable  in 
their  arguments  (with  probability  one)  for  all  N  . 
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Assumption  B6:   The  minimum  distance  estimators  based  upon  (v.  }  are  /R  - 
consistent. 

Assumption  B7:   The  matrix  of  second  partials  of  v.  converges  to  the 
corresponding  hessian  of  some  function  v,  satisfying  B2  and  B'3,   uniformly  in 
.  Y^r,  \i   and  xeM,  and  yE{$.(|j,,  y):  p,£M,  yeT},    with  probability  one. 

Corollary  1 :   Under  A1  -  A1 2  and  31  -  37,  the  minimum  distance  estimator 
Y  based  upon  the  distance  function  v_^  satisfies  the  conclusion  of  Theorem  3- 

Proof:   The  expressions  (4.5)  and  (4-.7)  above  for  y  and  p..    are  still 
valid,  since  the  hessian  of  v  evaluated  at  (I,  ,  X,  ,  y»  M'-t-)  converges  in 
probability  to  H  by  Lemma  4  of  Amemiya  [l973j. 

This  corollary  extends  the  result  of  Theorem  3  to  a  much  broader  class 
of  estimators,  including  the  ¥LS  estimator  y„  (which  trivially  satisfied  the 
conclusion  of  Theorem  3,  and  now  satisfies  the  hypothesis  35  -  37  as  well) 
and  the  maximum  likelihood  estimation  Ym  °-  (2- 9).   Strictly  speaking, 
application  of  the  corollary  to  maximum  likelihood  estimation  requires  that 
the  log  density,  log  d,  (Y,  ,  X.  \  y,    \x,    N ,  ) ,  can  be  transformed  into  a  distance 

function  v, (Y.,  X, ,  Yi  \^)    satisfying  35  -  37  through  proper  normalization 

-1/2 
(e.g.,  multiplication  by  -N    );  while  we  suspect  such  a  transformation  is 

feasible  for  most  applied  problems,  it  is  beyond  the  scope  of  this  paper  to 

provide  sufficient  conditions  for  equivalence  of  maximum  likelihood  and 

minimum  distance  estimation  because  of  the  generality  of  the  aggregates  Y. 

and  X.  considered. 

An  interesting  aspect  of  Theorem  4-1  is  that  the  asjnnptotic 
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distributions  of  [i    ,    the  minimum  distance  estimators  of  the  population 

aggregates,  and  X  ,  the  empirical  characteristic  aggregate,  need  not 

coincide.   Equation  (4-8)  implies 

(4.10)    /N(X^  -  ^^)   =  /N(Y^  -  $^(X^,  Y))a^  +  0^(1), 

where  a  =  [9  v  /dp-d^i.']   [s^v./S^ayJ.   That  is,  the  estimators  \i,    may 

exploit  the  dependence  of  the  distribution  of  Y.  (as  well  as  X^.)  on  the 

population  characteristics  \i^   in  its  estimation.   For  maximum  likelihod 

Li 

estimates,  it  is  clear  that  a,  will  equal  zero  if  X,  is  a  sufficient 
statistic  for  \i.,    as  it  will,  for  example,  if  Xj.  =  X,  and  the  distributions 
of  the  characteristics  x. ,  are  in  the  exponential  family  given  in  (3- 15);  the 
likelihood  function  will  then  factor  into  the  sum  of  the  conditional  log 
likelihood  of  Y.  given  X.  and  y   (which  will  not  depend  on  \i.    by  sufficiency) 
and  the  marginal  log  density  of  Xj.  given  u,..   In  general,  though,  when  X  is 
not  (locally)  sufficient  for  \x,    —  that  is,  when  the  ^   are  not  the  scores 
corresponding  to  maximum  likelihood  estimation  of  \i     based  upon  cross-section 
data  —  we  may  expect  the  optimal  a,  to  differ  from  zero  for  estimation  of 
the  population  characteristic  vector. 

Finally,  we  note  that  the  results  in  this  and  the  previous  section  do 
not  depend  upon  the  special  form  of  the  empirical  aggregates  Y.  =  Y(Q. )  and 
X.  =  X(P.)  assumed  here;  all  that  is  required  for  Theorems  3«1»  3'2,  and  4.1 
are  the  consistency  and  asymptotic  normality  (of  order  /N  )  of  the  aggregates 
Y.  and  X.  about  $.(n.  ,  y   )   and  \i.    respectively.   Thus,  for  example,  the 
results  will  still  hold  if  Y.  is  constructed  from  a  stratified  sample  of  the 
{y^j.}»  with  stratification  on  the  basis  of  the  characteristics  {x..  }.   We 
have  chosen  to  focus  on  the  more  restrictive  forms  of  Y  and  X,  because  they 
represent  the  most  common  type  of  aggregate  available,  that  is,  simple 
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averages  or  percentiles  from  marginal  tabulations  of  {y•^.}  and  {x■^.}.   It 
should  be  clear  that  more  efficient  aggregates  could  be  computed  given  the 
original  micro-data  and  knowledge  of  the  micro-model  (Assumption  Al);  in  such 
a  case,  though,  no  aggregation  problem  would  be  present,  and  the  usual 
results  on  efficiency  of  maximum  likelihood  estimation  of  y     would  apply. 

5.   Conclusion 

The  theorems  given  in  sections  3  and  4  above  provide  the  theoretical 
basis  for  large-sample  inference  on  microeconomic  behavioral  parameters  using 
empirical  aggregates.   Viewed  in  a  broader  context,  the  results  of  this  paper 
indicate  that  the  most  difficult  step  in  constructing  and  implementing  models 
of  aggregate  data  (that  properly  account  for  the  problem  of  aggregation  over 
individuals)  is  the  construction  of  the  appropriate  aggregate  function 
through  integration  (when  the  aggregates  are  sample  means)  or  more  general 
determination  of  the  population  aggregates  'i^(Q^)  and  X(P.  )  as  functions  of 
the  micro  parameters.   Once  the  aggregate  function  is  formulated,  parameters 
can  be  consistently  estimated  by  just  inserting  aggregate  data  and  performing 
least  squares.   Efficiency  gains  are  potentially  available  by  optimally 
weighting  observations,  as  indicated  above;  moreover,  using  the  optimal 
weights  exhausts  all  of  the  (first  order)  efficiency  gains  available.   Given 
the  validity  of  the  conjectures  of  section  3«2,  hypothesis  tests  for  the 
micro  parameters  can  be  performed  in  entirely  standard  fashion. 

While  we  have  indicated  the  generality  of  the  framework,  several 
extensions  are  of  interest  to  future  work.   With  regard  to  applications  with 
time  series  aggregate  data,  a  natural  questions  regards  how  to  incorporate 
autocorrelated  individual  stochastic  terms  into  estimation,  as  suggested  by 
the  standard  interpretation  of  the  disturbances  as  unobserved  attributes  of 
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individual  agents  (for  example,  demographic  effects  which  are  not  explicitly 
modeled).   It  is  clear  that  this  complication  would  only  affect  the 
efficiency  discussions  above,  with  the  consistency  of  the  estimators  proposed 
here  not  affected.   A  second  extension  concerns  how  to  incorporate  the 
endogeneity  of  predictor  variables  at  the  micro  and/or  macro  level,  which  is 
indicated  for  problems  where  simultaneous  equations  modeling  and/or 
expectations  modeling  is  important.   It  should  be  noted  here  that  our  results 
clearly  would  be  applicable  to  the  estimation  of  reduced  form  equations  in 
such  circumstances. 
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